
Advancements in Tiled-Based 
Compute Rendering 


Gareth Thomas 

Developer Technology Engineer, AMD 


GAME DEVELOPERS CONFERENCE 

MOSCONE CENTER ■ SAN FRANCISCO, CA 
MARCH 2-6, 2015 • EXPO: MARCH 4-6, 2015 



GAME DEVELOPERS CONFERENCE® 2015 


Agenda 

•Current Tech 
•Culling Improvements 
•Clustered Rendering 
•Summary 


MARCH 2-6, 2015 GDCONF.COM 



GAME DEVELOPERS CONFERENCE® 2015 


MARCH 2-6, 2015 GDCONF.COM 


Proven Tech - Out in the Wild 



Tiled Deferred [Andersson09] 

•Frostbite 
•UE4 
•Ryse 

>Forward+ [Harada et al 12] 

•DiRT & GRID Series 
•The Order: 1886 
•Ryse 
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Tiled Rendering 101 
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Tiled Rendering 101 
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Tiled Rendering 101 


Use z buffer from 
depth pre-pass 
as input 

Find min and max 
depth per tile 
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Use this frustum for 
intersection testing 
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Tiled Rendering 101 
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Tiled Rendering 101 
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•Z Prepass (on Forward+) 

•Depth bounds 
•Light Culling 
•Color Pass 
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Depth Bounds 

• Determine min and max 
bounds of the depth buffer 
on a per tile basis 

• Atomic Min Max [Andersson09] 
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groupshared uint ldsZMinj 
groupshared uint IdsZMaxj 

[numthreads(16_> 16^ 1)] 

void CalculateDepthBoundsCSf uintB globalldx SV_DispatchThreadID > uint3 localldx : SV_GroupThreadID ) 

{ 

uint localldxFlattened = localldx. x -I- localIdx,y*16; 

if( localldxFlattened == ■& ) 

{ 

IdsZMin = Gx7f7fffff; // FLT_MAX as a uint 
IdsZMax = 

} 


G r o u pMemo ry B a r r ie rU it hG ro u p Sy n c ( _) i 


float depth = g_DepthTexture, Load( uint3(globalIdx-x_ t globalIdx, y^G) ) - x; 



uint z = asuint( Conve rt Pro jDepthTo View ( depth ) 

II 

reinterpret as uint 

if( depth != ■ft.G ) 

InterlockedMaxf ldsZMax^ z ) ; 
InterlockedMin( ldsZMin^ z ); 

// 

atomic min & max 

G r o u pMemo ry B a r r i e rU it hG ro u p Sy n c ( ) } 

float maxZ = asfloat( IdsZMax )j 
float minZ = asfloat( ldsZM^^^^^ 

II 

reinterpret back to float 
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Parallel Reduction 

•Atomics are useful but not efficient 
•Compute-friendly algorithm 
•Great material already available: 

•"Optimizing Parallel Reduction in CUDA" [Harris07] 

•"Compute Shader Optimizations for AMD GPUs: Parallel Reduction" [Engell4] 
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depth[tid] 


depth[tid] 


depth[tid] 


min (depth [tid]^ depth [tid+8] ) 


min ( depth [ tid ] , depth [tid+4] ) 


min (depth [tid]^ depth [tid+2] ) 



depth[tid] = min(depth[tid] depth [tid+1] ) 
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Implementation details 

•First pass reads 4 depth samples 
•Needs to be separate pass 
•Write bounds to UAV 

•Maybe useful for other things too 
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groupshared float ldsZMlin [64]; 
group shared float Id sZMax [64] ■ 


[numthreads(S_ t B_ f 1)] 

void CalculateDepthBoundsCS( uintB globalldx : SV_DispatchThreadID_ f uint3 localldx : SVjGroupThreadIDj. uint3 groupldx :: 5V_GroupID ) 

{ 

uint2 sampleldx = globalldx- xy*2; 


float depthSS = g_£cenielDepthBuffer. Load (uint3 (sampleldx. x* sampleldx. y., S)).x; 
float depths! = i g_5ceneDepthBuffer.Load(uint3(sampleIdx.x, l sampleldx. y+l^S) ) .x; 
float depthlS = g_SceneD-epthBuffer.Load(uint3(sanpleIdx.x-l-l Jf sampleldx.yj S)).x; 
float depthll = g_5ceneDepthBuff er . Load (uint3( sampleldx. x+1^ sampleldx. y+l^S)) .x; 


float viewPosZSS = Co nvert Pro ]DepthToView( depthSS) ; 
float viewPosZSl = ConvertProjDepthToView(depthSl); 
float viewPosZIS = ConvertPro]DepthToView(depthlS); 
float viewPosZll = ConvertProjDepthToView(depthll); 


float 

minZSS = (depthSS 

! = S.f) 

? viewPosZSS : 

FLT_MAX; 

float maxZSS = (depthSS 

! = S.f) 

? viewPosZSS 

S.Sf; 

float 

minZSl = (depths! 

!= S.f) 

? viewPosZSl : 

FLTMAX; 

float maxZSl = (depths! 

3= S.f) 

? viewPosZSl 

S.Sf; 

float 

minZIS = (depthlS 

!= S.f) 

? viewPosZIS : 

FLTMAX; 

float maxZ IS = (depthlS 

3= S.f) 

? viewPosZIS 

S.Sf; 

float 

minZll = (depthll 

!= S.f) 

? viewPosZll : 

FLT_MAX; 

float maxZll = (depthll 

3= S.f) 

? viewPosZll 

S.Sf; 


uint threadNum = localldx. x + localldx. y*B; 


ldsZMlin [threadNum] = min(ninZSS J min(riinZSl J inin(iTiiinZlS J minZll) ) ); 
Ids ZMa x [ t h r e a d N um] = max ( ma xZSS ^ max ( ma xZSl j. ma x ( maxZ IS , maxZ 11))); 

GroupMlemoryBarrierWithGroupSync ( ) ; 


if (threadNum < 32) 

1 


ldsZMlin [threadNum] = min (ldsZMlin [threadNum] ldsZMlin [threadNumH-32] ); 
ldsZMlin [threadNum] = min ( ldsZMlin [threadNum] j ldsZMlin [threadNumH-16]); 
ldsZMlin [threadNum] = min (ldsZMlin [threadNum] jldsZMlinfthreadNumH-8]); 
ldsZMlin [threadNum] = min (ldsZMlin [threadNum] j ldsZMlin [threadNumH-4] ); 
ldsZMlin [threadNum] = min ( ldsZMlin [threadNum] ^ ldsZMlin [threadNumH-2]); 
ldsZMlin [threadNum] = min ( ldsZMlin [threadNum] ldsZMlin [threadNum+1]); 


IdsZMaxf threadNum] = max(ldsZMax[threadNum] J ldsZMax[threadNumrF32]) 
IdsZMaxf threadNum] = max(ldsZMlax[threadNum] jldsZMaxfthreadNum+16]) 
ldsZMlaxf threadNum] = max (Id sZMax [threadNum] J ldsZMax[threadNum+B] ); 
Id sZMax [threadNum] = max(ldsZMlax[threadNum] jldsZMaxfthreadNum+4] ); 
IdsZMaxf threadNum] = max ( Id sZMax [t h re adN um] 3 Id sZMax [threadNun+2]); 
IdsZMaxf threadNum] = max ( Id sZMax [t h readN um] ,, Id sZMax [th re adNum+1]); 


GroupMemoryBarrierUithG roup Sync ( ) ; 


if (threadNum = S) 

{ 

g_DepthBounds[ groupldx. :xy] = float2(ldsZMin[0] J ldsZMax[S] ); 

} 
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Parallel Reduction - Performance 


Atomic Parallel 

Min/Max Reduction 


AMD R9 290X 1.8ms 1.60ms 

NVIDIA GTX 980 1.8ms 1.54ms 


Combined cost of depth bounds and light culling of 2048 lights at 3840x2160 

Parallel reduction pass takes ~0.35ms 

Faster than Atomic Min/Max on the GPUs tested 
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Light Culling: 

The Intersection Test 
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Sphere-Frustum Test 
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Sphere-Frustum Test 
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AABB around 
long frustum 


AABB around 
short frustum 
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Arvo Intersection Test [Arvo90] 

bool TestSphereVsAABB(float3 sphereCenter., float sphereRadius., float3 AABBCenter^ float3 AABBHalf Size) 


f- 

flcat3 delta = maxdQj, abs(AABBCenter - sphereCenter) - AABBHalf Size)i 


float distSq = dot(delta^ delta).; 

r- 

return distSq <= sphereRadius * sphereRadius j 
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Single Point Light 
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Frustum/Sphere Test 
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Arvo AABB/Sphere Test 
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Culling Spot Lights 


•Don't put bounding 
sphere around spot light 
origin 

•Tightly bound spot light 
inside sphere at P with 
radius r 
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•spot position 
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Depth Discontinuities 



GAME DEVELOPERS CONFERENCE® 2015 


Depth Discontinuities 


Scene Geometry 
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False Positives 
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2.5D Culling 


[Harada et al 12] 


Scene Geometry 
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HalfZ 


MaxZ 


Scene Geometry 



MinZ 
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HalfZ 



HalfZ low bits 
HalfZ high bits 
numLights near side 
numLights far side 
light indices... 



16 bit light index buffer 

size: maxLightsPerTile x 2 + 4 
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Modified HalfZ 

MaxZ 

•Calculate Min & Max Z as normal \ ^ — MinZ2 

•Calculate HalfZ 

•Second set of Min and Max values using 

HalfZ and max & min respectively > * HalfZ 

•Test against near bounds and far bounds 
•Write to either one list 

•Or write to two lists cf. HalfZ \ v - MaxZ2 

•Doubles the work in the depth bounds pass 
•Worst case converges on HalfZ MinZ 
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Sponza Atrium + 1 million sub pixel triangles 
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Cl Advancements in Tiled-Based Rendering vO.l 


000 
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MinMax depth bounds, Frustum culling 
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MinMax depth bounds, AABB culling 
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MinMax depth bounds, Hybrid culling (AABB + Frustum sides) 
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Km Lights 
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Modified HalfZ depth bounds, AABB culling 
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Unreal Engine 4, Infiltrator Demo 
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MinMax Depth Bounds 
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Modified HalfZ in one light list 
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Color Pass Time - HalfZ vs 2.5D vs Modified HalfZ <§> 4k, R9 290X 



-MinMax, Frustum 
-HalfZ, AABB 
-2.5D, Hybrid 
-MHZ, AABB 
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Culling Time - MinMax vs 2.5D vs Modified HalfZ @ 4k, R9 290X 



-MinMax, Frustum 
-MHZ, AABB 
-2.5D, Hybrid 
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What happens if we cull 32x32 tiles? 


Still using 16x16 thread groups 
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Modified HalfZ - 16x16 Tiles vs 32x32 Tiles @ 4k, R9 290X 




16x16 Total Time 

32x32 Total Time 

16x16 Color Pass 
32x32 Color Pass 

16x16 AABB Culling 

32x32 Hybrid Culling 


Num Lights 
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Culling Conclusion 

•Modified HalfZ with AABBs generally works best 

•Even though generating MinZ2 and MaxZ2 adds a little cost 
•Even though culling each light against two AABBs instead of one 

• 32 x 32 tiles saves a good chunk of time in the culling stage 

•...at the cost of color pass efficiency when pushing larger number of lights 
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Clustered Rendering [Olsson et al 12] 


•Production proven in Forza Horizon 2 


•Additional benefits on top of 2D 
culling: 

•No mandatory Z prepass 

•Just works™ for transparencies and 
volumetric effects 


•Can a further reduction in lights per 
pixel improve performance? 
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Clustered Rendering 101 
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Clustered Rendering 


•Divide up Z axis 
exponentially 

•Start at some sensible 
near slice 

•Cap at some sensible 
value 
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Provision for far lights 


• Fade them out 

• Drop back to glares 

• Prebake 
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Light Culling 

•View space AABBs worked best on 
2D grid 

•Bad when running say 16 slices 

•View space frustum planes are 
better 

•Calculate per tile planes 
•Then test each slice near and far 
•Optionally, then test AABBs 
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VRAM Usage 

• 16x16 pixel 2D grid requires numTilesX x numTilesY x 
maxLights 

• 1080p: 120x68x512xuintl6 = 8MB 
•4k: 240xl35x512xuintl6 = 32MB 
•List for each light type (points & spots): 64MB 
•So 32 slices: 1GB for point lights only © 

•Either use coarser grid 
•Or use a compacted list 
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Compacted List 

•Option 1: 

• Do all culling on CPU [Olsson etall2] [Perssonl3][Dufresnel4] 
•But some of the lights may be spawned by the GPU 
•My CPU is a precious resource! 

• Option 2: 

.Cull on GPU 

•Keep track of how many lights per slice in TGSM 
•Write table of offsets in light list header 
•Only need maxLights x "safety factor" per tile 
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Coarse Grid 


Example: 


•4k resolution 
•64x64 pixel tiles with 64 slice 
•maxLights = 512 

•60 x 34 tiles x 64 slices x 512 
uintl6 = 128MB 



x 
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Total Time - Modified HalfZ vs Clustered @ 4k, R9 290X 



Modified HalfZ 

Clustered 32x32x32 

Clustered 64x64x128 

Clustered 64x64x32 


Num Lights 
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Color Pass Time - Modified HalfZ vs Clustered @ 4k, R9 290X 


4.5 



Num Lights 


Modified HalfZ 

Clustered 64x64x32 

Clustered 32x32x32 
Clustered 64x64x128 
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Culling Time - Modified HalfZ vs Clustered @ 4k, R9 290X 



256 512 1024 2048 4096 


Modified HalfZ 

Clustered 64x64x128 

Clustered 32x32x32 

Clustered 64x64x32 


Num Lights 
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Z Prepass 

•Very scene dependant 

•Often considered too expensive 

•DirectX12 can help draw submission cost 

•Should already have a super optimized depth only path for 
shadows! 

• Position only streams 

• Index buffer to batch materials together 

•A partial prepass can really help lighten the geometry load 
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Conclusions 

• Parallel Reduction - faster than atomic min/max 

•AABB-Sphere test in conjunction with Modified HalfZ is a 
good choice 

•Clustered shading 

•Potentially a big saving on the tile culling 
•Less overhead for low light numbers 
•Offers other benefits over 2D tiling 
•Aggressive culling is very worthwhile 

•The best optimisation for your expensive color scene 
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